Optimizing Theta-Joins in a MapReduce Environment

نویسندگان

  • Changchun Zhang
  • Jing Li
  • Lei Wu
چکیده

Data analyzing and processing are important tasks in cloud computing. In this field, the MapReduce framework has become a more and more popular tool to analyze large-scale data over large clusters. Compared with the parallel relational database, it has the advantages of excellent scalability and good fault tolerance. However, the performance of join operation using MapReduce is not as good as that of parallel relational database. Thus, how to optimize theta-join operations using MapReduce is an attractive point to which researchers have been paying attention. In this paper, a randomized algorithm named Strict-Even-Join(SEJ) is designed to solve the multi-way theta-joins in a single MapReduce job. Moreover, a dynamic programming algorithm is elaborated to optimize the multi-way theta-joins by calling the SEJ algorithm. The results of experiments show that our approach is feasible and effective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements

We deal with binary theta-joins in a MapReduce environment, and we make two contributions. First, we show that the best known algorithm to date for this problem can reach the optimal trade-o↵ between the size of the input a reducer can receive and the incurred communication cost when the join selectivity is high. Second, when the join selectivity is low, we present improvements upon the state-o...

متن کامل

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

MapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to elim...

متن کامل

Evaluation of view maintenance with complex joins in a data warehouse environment

Data warehouse maintenance and maintenance cost has been well studied in the literature. Integrating data sources, in a data warehouse environment, may often need data cleaning, transformation, or any other function applied to the data in order to integrate it. The impact on view maintenance, when data is integrated with other comparison operators than defined in theta join, has, however, not b...

متن کامل

Efficient Processing Distributed Joins with Bloomfilter using MapReduce †

The MapReduce framework has been widely used to process and analyze largescale datasets over large clusters. As an essential problem, join operation among large clusters attracts more and more attention in recent years due to the utilization of MapReduce. Many strategies have been proposed to improve the efficiency of distributed join, among which bloomfilter is a successful one. However, the b...

متن کامل

Efficient Large Outer Joins over MapReduce

Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013